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Abstract 

We study the primary DNA structure of four of the most completely 
sequenced human chromosomes (including chromosome 19 which is the 
most dense in coding), using Non-extensive Statistics. We show that 
the exponents governing the decay of the coding size distributions vary 
between 5.2 < r < 5.7 for the short scales and 1.45 < q < 1.50 for the 
large scales. On the contrary, the exponents governing the decay of 
the non-coding size distributions in these four chromosomes, take the 
values 2.4 < r < 3.2 for the short scales and 1.50 < q < 1.72 for the 
large scales. This quantitative difference, in particular in the tail expo- 
nent q, indicates that the non-coding (coding) size distributions have 
long (short) range correlations. This non-trivial difference in the DNA 
statistics is attributed to the non-conservative (conservative) evolution 
dynamics acting on the non-coding (coding) DNA sequences. 

PACS Numbers: 89.75.-k (Systems Obeying Scaling Laws); 87.17.Gg 
(DNA, RNA); 05.65.+b (Self-organised Systems). 
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1 Introduction 

During recent years numerous studies on the statistics of genomic sequences 
have demonstrated various degrees of complexity in the primary structure 
of DNA. In particular, Peng et al. in 1992 demonstrated the existence of 
long range correlations using the "DNA walk" model . Similar conclusions 
were reached by Li et al. [2| and Voss jSj using the 1// spectrum and later by 
studies on the size distribution of Purine (Adenine, Guanine) and Pyrimidine 
(Thymine, Cytocine) clusters in coding and non-coding regions of different 
organisms [UlSj. Other studies manifested long range correlations and power 
laws in the primary structure of DNA using a variety of statistical methods 
ranging from wavelets to linguistic approaches |H1 EJ |H1 H3 E3 EEH • 

In recent studies, one of the present authors (AP) and coworkers have 
shown that the long range distributions of Pyrine and Pyrimidine clusters in 
the non-coding regions of higher eucaryotes are related to similar long range 
distributions present at a higher level of genomic organisation: the level of 
coding and non-coding alternating regions |12[ I13 [ IT%j. 

Non-extensive Statistical Mechanics is particularly fitted to describe 
complex structures which present long range correlations, power laws and 
fractality 1.5... In particular, non-extensive statistics have been used to de- 
scribe successfully complex spatiotemporal structures in diverse fields such 
as high energy physics JB], turbulence |T2], biological systems jTHJ , anoma- 
lous diffusion ^5], classical and quantum chaos [2*U] . interacting particle 
systems [2J| an d reactive dynamics |2"2"] . 

Classical Statistical Mechanics uses the Boltzmann Gibbs (BG) Entropy, 
Sbg, defined as: 

w 

Sbg = -^Pilnpi (1) 
i=i 

to describe the properties of systems at equilibrium. In Eq. ^ pi denotes 
the probabilities of the i — th microscopic state and the average runs over 
the total number of states W. This BG entropic form can not successfully 
describe systems in which self-organisation, long range features and scaling 
are observed. As a generalisation of Eq. ^ Tsallis and coworkers j^Sl have 
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introduced the non-extensive entropy, denned as: 

w 

i - Y.p1 

S q = for q + \ (2) 

q — 1 

where q is the non-extensivity exponent. Note that for q = 1 the classical 
BG statistics (Eq. ^) is recovered and thus departure of the exponent q from 
the value 1 signals departure from BG statistics. 

In relation to non-extensive statistics, long range decay maybe obtained 
by a non- linear dynamical process expressed by |24j : 

^ = - Kq ?, for ( Kq >0,q^l) (3) 

In particular, for q > 1 long range decay is manifested, while for q = 1 the 
well known exponential decay is obtained. The solution of Eq. |3]is: 

= [1-(1-<ZK(*-1)] 1/(1 ~ 9) , for ( Kq >0,q>l) (4) 
= exp(— Kx(s — 1)), for (Ki>0,g = l) 

with initial condition £(1) = 1. Thus for q > 1 a long range law (power law 
decay) is obtained, while for q = 1 a short range law (exponential decay) 
emerges. 

For phenomena where two or more dynamical mechanisms act in the sys- 
tem producing different decay laws in different length (and/or time) scales, 
a further phenomenological generalisation of Eq. |3] maybe introduced by 
addition of terms carrying different powers [21]. The simplest one carries 
only one additional term and is: 

f s =- Kq ^-(X r - Kq )e, for (q<r) (5) 

Note that Eq. |3]is recovered for n q = X r (Vr). The solution of Eq. [5] can 
not be written in a simple form but it may be shown that it consists of two 
distinct power law regions, one governed by the exponent q and one by the 
exponent r |24j . 

In Fig. ^ we present the size distribution of coding and non-coding DNA 
sequences in chromosome 16. To avoid local fluctuations running averages 
are considered over 15 Base Pairs (bps). For clarity only the first 1000 points 
are shown. The maximum size of coding regions is of the order of 7000-8000 
bps (reaches 20000bps for chromosome 19) while the maximum sizes of the 
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Figure 1: Running average over 15 points of the size distribution of coding 
and non-coding DNA in chromosome 16. Only first 1000 bps are shown. 

non-coding regions reach 10 8 bps. The coding size distributions are rich in 
small segments of the order of 100-110 bps and then fall fast, while the non- 
coding ones have a similar maximum in the small scales and fall relatively 
slower. For comparison we also present the size distribution of chromosome 
17, in Fig. |5J in double logarithmic scale where the entire s-range is shown. 

Comparing Figs. ^ and El we note that the size distributions of non- 
coding DNA, has a complex form but we may clearly distinguish two regions: 
one region at the short scales which is bell-shaped and which describes the 
introns (non-coding regions within genes) and one region at the larger scales 
which contains long tail and which describes the non-coding intergenic re- 
gions. It is thus natural, at the phenomenological level, to use the dynamics 
of Eq. [21 for the description of the complex shape of the size distribution of 
non-coding DNA hoping to capture these two trends, the introns and the 
intergenic regions. 

In the current study we use non-extensive statistics to study the size dis- 
tributions of coding and non-coding sequences in the human genome which 
is now near completion. We have selected to study 4 of the most complete 
human chromosomes including chromosome 19 which contains the highest 
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Figure 2: The size distribution of non-coding DNA in chromosome 17 in a 
double logarithmic scale (all data). 



percentage of coding. In the next section we concentrate on the primary 
structure of the human genome and we give details on the particular data 
we use. In sections 01 and 0] we present the analysis of the size distribution 
of coding and non-coding DNA, respectively. We conclude by summarising 
our main results and discussing some open problems. 



2 The Human Genome 

Although officially the human genome project is announced to be near com- 
pletion, in the international EMBL and GenBank genomic data bases the 
sequence data deposited varies from 98.91% for chromosome 17 to 43.1% for 
chromosome Y. The unknown base pairs are usually denoted by the letter 
N= (unknown base pair) and they are either isolated or appear in clusters. 
The meaning of N is not unique. It might denote a base pair which resists 
to sequencing methods completely or partially. Resisting partially means 
that partial information on the base is known, for example being a Purine 
or a Pyrimidine. Another case is that the various laboratories which verify 
the sequencing may not agree on this base pair. 
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In the current project we analyse the complete primary structures of 
human chromosomes 6, 16, 17 for which the N percentage is the smallest 
and also chromosome 19, which contains the highest percentage of coding 
DNA, 3.8%. The sequenced percentage presented in the data bases and the 
coding percentage of these are shown in Table ^ After downloading the 
chromosomes we isolate the coding and non-coding segments and calculate 
their respective size distributions for each of them. Representative plot is 
shown in Fig. EJ Due to the heavy fluctuations in the data we prefer to 
work with the cumulative distributions P(s) defined as: 

/•CO 

P( s ) = J P(l)dl (6) 

where P(l) is the usual distribution of coding or non-coding regions of size I. 
In general, due to summation the cumulative distributions have better sta- 
tistical properties than the usual distribution functions while they keep the 
main data trends. Notice that, if the distribution P(l) has the exponential 
(short range) form its cumulative P(s) will also have the exponential form. 
If the distribution function has a power law form of the type: 

P(l) ~ r 1 -^ (7) 

then the cumulative distribution will have a power law form with exponent 

/oo 
r^dl = a - **, 0<n<2. (8) 

Cumulative diagrams of the four coding and non-coding cumulative size 
distributions are shown in Figs. El and 0J respectively. The non-extensive 
analysis of these distributions follows in the next two sections. 

3 Coding DNA 

As we have already seen in Fig. Q the coding size distributions have a 
bell-shape and their tails in the large scales fall relatively fast. To give a 
quantitative account for the decay of the tails of the distributions we plot 
the cumulative size distributions in Fig. |3] (solid lines). 

To describe the shape of the four curves we use the phenomenological 
non-extensive description of Eq. [5] and the corresponding curves are also 
shown in the same figures (dashed lines). The theoretical lines approximate 
well the data. The exponents q which describe the tails of the distributions 
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Figure 3: The cumulative size distributions of coding DNA in chromosomes 
6, 16, 17 and 19 (solid lines) and the non-linear fits using Eq. |S] (dashed 
lines). 



vary between 1.45 < q < 1.50 for the four chromosomes and their specific 
values are given in Table ^ The non-extensive exponent q corresponds to 
power law tails of the form Eq. Qwith exponent /i given by 

/z = -l/(l-g). (9) 

Thus the tails of the coding size distributions present short range correla- 
tions, since > 2. The exponent r which expresses the small scale charac- 
teristics, takes values between 5.2 < r < 5.6 for these chromosomes. Similar 
results have also been observed for the other human chromosomes. The 
similarity of the two exponents in the four chromosomes indicate that the 
same (or similar) dynamical processes have created the coding parts of all 
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Table 1: Non-extensive exponents and parameters describing the coding 
size distributions. 



chromosomes during evolution. This dynamics must be of conservative type 
in short time scales, since coding DNA changes very slowly (behaves as an 
almost-closed system) and this is consistent with short range correlations 
|13j . As the human genome annotation advances, we expect that the expo- 
nents r and q may be modified and/or other exponents may be needed for 
a more complete statistical non-extensive description. 

4 Non-Coding DNA 

The cumulative size distributions of the non-coding DNA in the four chro- 
mosomes are shown in Fig. (solid lines). We observe that the four distri- 
butions have as common characteristic a long tail which can be expressed 
in the form of a pure power law ^2]. In the smaller scales the decay is 
characterised by a different exponent which is very similar for the four dis- 
tributions. 

To describe the shape of the four curves we use the phenomenologi- 
cal non-extensive description of. Eq. and the corresponding curves are 
shown in the same figures (dashed lines). The theoretical lines are very 
faithful approximations to the data. The exponents q which describe the 
long tails of the distributions are very close for the four chromosomes and 
their corresponding values are given in Table |^1 Their values vary between 
1.50 < q < 1.72. The non-extensive exponent q corresponds to a power law 
of the form Eq. with exponent fi being within the bounds < fi < 2, 
which indicates clear long range correlations. In the case of chromosome 
19, which (up to now) contains the highest coding percentage amongst all 
human chromosomes, the value of fi calculated through Eq. O is equal to 2, 
which is border line case between short and long range correlations. On the 
other hand, the exponent r which expresses the small scale characteristics, 
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Figure 4: The cumulative size distributions of non-coding DNA in chro- 
mosomes 6, 16, 17 and 19 (solid lines) and the non-linear fits using Eq. |S] 
(dashed lines). 

The different small and large scale behaviors observed in the size dis- 
tribution of the non-coding indicates that different dynamical mechanisms 
are involved in the formation of small non-coding segments (which are usu- 
ally found as introns in the genes) and in the large non-coding areas, or 
intergenic regions which are found between genes and between families of 
genes. The intergenic regions are extended non-coding regions which can 
support extensive (massive) influx and outflux of genomic material. Thus 
the ensemble of intergenic regions acts as an open system which supports 
exchange with the environment. In open systems, out of equilibrium, power 
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Table 2: Non-extensive exponents and parameters describing the non- 
coding size distributions. 

laws and long range correlations which may be regarded as expression of 
non-extensive or edge of chaos dynamics, naturally emerge. Open aggregat- 
ing systems, with influx mechanisms similar to the ones involved in genomic 
evolution and which lead to long range correlations are presented in refer- 
ence |13| . On the other hand, the non-coding segments found within genes, 
called also introns are less supportive to external influences because often 
they include functional strings. Thus they behave more like closed systems 
and thus the dynamics must be very different, which is also expressed by 
the difference in the exponents q and r. 

5 Conclusions 

We have studied the size distribution of all known coding and non-coding 
sequences in human chromosomes 6, 16, 17 and 19. The first three were 
selected as representatives of the most completely sequenced chromosomes 
while chromosome 19 has the highest, up to date, coding percentage. We 
have found that the decay of the non-coding size distributions is consistent 
with non-extensive dynamics as expressed by non-linear Eq. 03 Moreover, we 
have shown that the non-coding presents two distinct regions, one large scale 
region, related to the intergenic non-coding DNA which presents a decaying 
exponent 1.5 < q < 1.72, and a second short scale region related to the 
introns (non-coding DNA within genes) which presents a decaying exponent 
r > 2.4. On the contrary, short range correlations have been observed in the 
tails of the coding distributions with non-extensive exponents 1.45 < q < 
1.50. This is consistent with earlier observed long (short) range correlations 
in the non-coding (coding) size distributions of higher eucaryotes. All other 
human chromosomes demonstrate similar characteristics. 
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A more detailed analysis could involve the use of more terms with dif- 
ferent exponents in Eq. [5J in order to capture more details such as the 
dynamical exponent which govern non-coding distances between families of 
homologous genes (they may be governed by one of the current exponents, 
g or r, or by a third one). 

It is true that today the human chromosomes may be close to full se- 
quencing but their complete annotation will take much longer. This means 
that there are still coding sequences which are not discovered within the 
genome. Thus we expect that with the advancement of DNA annotation, 
which is the next major step in genomics after sequencing, we will be able 
to give more precise, final values to the exponents q and r for the human 
genome. Also the study in parallel of the genomes of other organisms, as 
they become sequenced and annotated, will allow for a comparative analysis 
of genomic data between different classes of organisms. 
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